Last compiled: 2021-09-08
Problem Statement An organization wants to know which companies are similar to each other to help in identifying potential customers of a SaaS software solution (e.g. Salesforce CRM or equivalent) in various segments of the market. The Sales Department is very interested in this analysis, which will help them more easily penetrate various market segments.
I will be using stock prices in this analysis. Companies will be classified based on how their stocks trade using their daily stock returns (percentage movement from one day to the next). This analysis will help the organization determine which companies are related to each other (competitors and have similar attributes).
I’ll be able to analyze the stock prices using unsupervised learning tools including K-Means and UMAP. Moreover, I will be using a combination of kmeans() to find groups and umap() to visualize similarity of daily stock returns.
Goal
The goal is to apply my knowledge on K-Means and UMAP along with dplyr, ggplot2, and purrr to create a visualization that identifies subgroups in the S&P 500 Index. I will specifically be applying the following:
kmeans() and umap()
purrr
dplyr, tidyr, and tibble
ggplot2 (bonus plotly)
As a first step, please load tidyverse, tidyquant, broom and umap libraries. For details on what these libraries offer, please refer to the comments in the code block below.
# STEP 1: Load Libraries ---
# install.packages("plotly")
# Tidy, Transform, & Visualize
library(tidyverse)
# library(tibble) --> is a modern re-imagining of the data frame
# library(readr) --> provides a fast and friendly way to read rectangular data like csv
# library(dplyr) --> provides a grammar of data manipulation
# library(magrittr) --> offers a set of operators which make your code more readable (pipe operator)
# library(tidyr) --> provides a set of functions that help you get to tidy data
# library(stringr) --> provides a cohesive set of functions designed to make working with strings as easy as possible
# library(ggplot2) --> graphics
library(tidyquant) # Bringing business and financial analysis to the 'tidyverse'
library(broom) # Takes messy output and turns them into tidy tibbles
library(umap) # Uniform manifold approximation and projection - dimension reduction technique
library(ggplot2) # To access other themes for better plotting
library(colorspace) # For better selection of colors
If you haven’t installed these packages, please install them by calling install.packages([name_of_package]) in the R console. After installing, run the above code block again.
I will be using stock prices in this analysis. Although an API can be used to retrieve stock prices, I am already providing the stock prices for every stock in the S&P 500 index. I will be working with a sp_500_prices_tbl and sp_500_index_tbl data sets (source of raw data is linked below). You may download the data in case you want to try this code on your own.